432 research outputs found

    GlyGen: Computational and informatics resources and tools for glycosciences research

    Get PDF
    Although ongoing technical advances are accelerating the pace and sophistication of data acquisition in glycoscience, the transformation of these data to glycobiology knowledge, insight, and understanding is slowed by the limited number of tools that facilitate their integration with biological knowledge. Thus, to fill in the critical gaps, there is a need for a broadly relevant and sustainable glycoinformatics resource that can provide tools and data to address specific glycoscience questions. GlyGen is an integrated, extendable and cross-disciplinary glycoinformatics resource that will facilitate knowledge discovery in basic and translational glycobiology by integrating multidisciplinary data and knowledge from diverse resources. It will address glycobiology questions that can currently be answered only by extensive literature-based research and manual collection of data from disparate resources. The aims of the GlyGen project includes integrating and exchange of up-to-date glycobiology-related information and data with partnering data sources such as EMBL-EBI, NCBI, UniProt, UniCarbKB, and others; creating an intuitive web portal to search and browse for glycoscience knowledge that will also include off-line data analysis, data exploration, and mining. Furthermore, the GlyGen project includes the development of essential new information resources, namely the Glycan Microarray Database that will provide key information about the interactions of glycans with other biomolecules and a Glycan Naming Ontology (GNOme) that facilitates interpretation of incomplete structural information in the context of biological functions. GlyGen\u27s comprehensive data integration framework and valuable user\u27s feedback will provide unprecedented support for complex queries spanning diverse data types relevant to glycobiology, extending its scope beyond the mapping of glycan data to genes and proteins. The resource would be publicly available and will facilitate the sharing and dissemination of glycobiology knowledge. It will provide new opportunities for a systems-level understanding of glycobiology in disease and development, even for scientists who do not specialize in glycobiology

    High-performance integrated virtual environment (HIVE) tools and applications for big data analysis

    Get PDF
    The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis

    CoreGenes: A computational tool for identifying and cataloging "core" genes in a set of small genomes

    Get PDF
    BACKGROUND: Improvements in DNA sequencing technology and methodology have led to the rapid expansion of databases comprising DNA sequence, gene and genome data. Lower operational costs and heightened interest resulting from initial intriguing novel discoveries from genomics are also contributing to the accumulation of these data sets. A major challenge is to analyze and to mine data from these databases, especially whole genomes. There is a need for computational tools that look globally at genomes for data mining. RESULTS: CoreGenes is a global JAVA-based interactive data mining tool that identifies and catalogs a "core" set of genes from two to five small whole genomes simultaneously. CoreGenes performs hierarchical and iterative BLASTP analyses using one genome as a reference and another as a query. Subsequent query genomes are compared against each newly generated "consensus." These iterations lead to a matrix comprising related genes from this set of genomes, e. g., viruses, mitochondria and chloroplasts. Currently the software is limited to small genomes on the order of 330 kilobases or less. CONCLUSION: A computational tool CoreGenes has been developed to analyze small whole genomes globally. BLAST score-related and putatively essential "core" gene data are displayed as a table with links to GenBank for further data on the genes of interest. This web resource is available at http://pumpkins.ib3.gmu.edu:8080/CoreGenes or http://www.bif.atcc.org/CoreGenes

    Ferromagnetism in nanoscale BiFeO3

    Full text link
    A remarkably high saturation magnetization of ~0.4mu_B/Fe along with room temperature ferromagnetic hysteresis loop has been observed in nanoscale (4-40 nm) multiferroic BiFeO_3 which in bulk form exhibits weak magnetization (~0.02mu_B/Fe) and an antiferromagnetic order. The magnetic hysteresis loops, however, exhibit exchange bias as well as vertical asymmetry which could be because of spin pinning at the boundaries between ferromagnetic and antiferromagnetic domains. Interestingly, like in bulk BiFeO_3, both the calorimetric and dielectric permittivity data in nanoscale BiFeO_3 exhibit characteristic features at the magnetic transition point. These features establish formation of a true ferromagnetic-ferroelectric system with a coupling between the respective order parameters in nanoscale BiFeO_3.Comment: 13 pages including 4 figures; pdf only; submitted to Appl. Phys. Let

    Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly

    Get PDF
    Background: The transition to Next Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade. However, NGS sequencing trades high read throughput for shorter read length, increasing the difficulty for genome assembly. This research presents a comparison of traditional versus Cloud computing-based genome assembly software, using as examples the Velvet and Contrail assemblers and reads from the genome sequence of the zebrafish (Danio rerio) model organism. Results: The first phase of the analysis involved a subset of the zebrafish data set (2X coverage) and best results were obtained using K-mer size of 65, while it was observed that Velvet takes less time than Contrail to complete the assembly. In the next phase, genome assembly was attempted using the full dataset of read coverage 192x and while Velvet failed to complete on a 256GB memory compute server, Contrail completed but required 240hours of computation. Conclusion: This research concludes that for deciding on which assembler software to use, the size of the dataset and available computing hardware should be taken into consideration. For a relatively small sequencing dataset, such as microbial or small eukaryotic genome, the Velvet assembler is a good option. However, for larger datasets Velvet requires large-memory compute servers in the order of 1000GB or more. On the other hand, Contrail is implemented using Hadoop, which performs the assembly in parallel across nodes of a compute cluster. Furthermore, Hadoop clusters can be rented on-demand from Cloud computing providers, and therefore Contrail can provide a simple and cost effective way for genome assembly of data generated at laboratories that lack the infrastructure or funds to build their own clusters

    Multi-omics approaches to studying gastrointestinal microbiome in the context of precision medicine and machine learning

    Get PDF
    The human gastrointestinal (gut) microbiome plays a critical role in maintaining host health and has been increasingly recognized as an important factor in precision medicine. High-throughput sequencing technologies have revolutionized -omics data generation, facilitating the characterization of the human gut microbiome with exceptional resolution. The analysis of various -omics data, including metatranscriptomics, metagenomics, glycomics, and metabolomics, holds potential for personalized therapies by revealing information about functional genes, microbial composition, glycans, and metabolites. This multi-omics approach has not only provided insights into the role of the gut microbiome in various diseases but has also facilitated the identification of microbial biomarkers for diagnosis, prognosis, and treatment. Machine learning algorithms have emerged as powerful tools for extracting meaningful insights from complex datasets, and more recently have been applied to metagenomics data via efficiently identifying microbial signatures, predicting disease states, and determining potential therapeutic targets. Despite these rapid advancements, several challenges remain, such as key knowledge gaps, algorithm selection, and bioinformatics software parametrization. In this mini-review, our primary focus is metagenomics, while recognizing that other -omics can enhance our understanding of the functional diversity of organisms and how they interact with the host. We aim to explore the current intersection of multi-omics, precision medicine, and machine learning in advancing our understanding of the gut microbiome. A multidisciplinary approach holds promise for improving patient outcomes in the era of precision medicine, as we unravel the intricate interactions between the microbiome and human health

    Analysis of HIV-1 quasispecies sequences generated by High Throughput Sequencing (HTS) using HIVE

    Get PDF
    The high level of genetic variability of Human Immunodeficiency Virus type 1 (HIV-1) is caused by the low fidelity of its replication machinery. This leads to evolution of swarm-like viral populations often described as quasispecies. High throughput sequencing (HTS) technology provides higher resolution over Sanger sequencing, enabling detection of low frequency variant genomes. However, quasispecies analysis is still a challenge due to the systematic noise, introduced by HTS technology. This leads to the increase in type I errors (also known as false positives) and the underlying genetic diversity, which can lead to mathematically insolvable type II errors (also known as false negatives). We have developed a pipeline using the tools in the High-performance Integrated Virtual Environment (HIVE), an HTS platform designed for big data analysis and management, to analyze viral populations within each sample and identify their subtype classification and recombination patterns of recombinants. RNA was extracted from 70 plasma samples of chronic HIV-1 infected patients. The 3’ half genomes of HIV-1 were amplified using RT-PCR and PCR products were sequenced using Illumina MiSeq. The paired end reads for each sample were assembled using Geneious software and analyzed for presence of HIV-1 quasispecies using HIVE tools. Subtype analysis of 70 samples using Geneious software identified 17 A1s, 4 Bs, 30 Cs, 1 D, 6 CRF02_AG, and 12 unique recombinant forms (URFs). Additionally, we found up to 178 ambiguous bases in the consensus sequences from 41 viral samples (58.6%), suggesting the presence of viral subpopulations. However, Geneious could not determine the major viral populations in each sample. We analyzed the same HTS reads using the HIV-1 quasispecies analysis pipeline and found one predominant population in 11 samples (15.7 %), two to ten distinct populations in 45 samples (64.3%), 11-20 in 13 samples (18.16%), and 26 in one sample (1.4 %). Interestingly, two equally major viral populations that were not detected by Geneious were identified in five samples (7.1%) by HIVE. The HIV-1 quasispecies analysis pipeline is reliable and more sensitive in its ability to identify distinct viral populations and the recombination patterns not identified by the Geneious software

    GeneOrder3.0: Software for comparing the order of genes in pairs of small bacterial genomes

    Get PDF
    BACKGROUND: An increasing number of whole viral and bacterial genomes are being sequenced and deposited in public databases. In parallel to the mounting interest in whole genomes, the number of whole genome analyses software tools is also increasing. GeneOrder was originally developed to provide an analysis of genes between two genomes, allowing visualization of gene order and synteny comparisons of any small genomes. It was originally developed for comparing virus, mitochondrion and chloroplast genomes. This is now extended to small bacterial genomes of sizes less than 2 Mb. RESULTS: GeneOrder3.0 has been developed and validated successfully on several small bacterial genomes (ca. 580 kb to 1.83 Mb) archived in the NCBI GenBank database. It is an updated web-based "on-the-fly" computational tool allowing gene order and synteny comparisons of any two small bacterial genomes. Analyses of several bacterial genomes show that a large amount of gene and genome re-arrangement occurs, as seen with earlier DNA software tools. This can be displayed at the protein level using GeneOrder3.0. Whole genome alignments of genes are presented in both a table and a dot plot. This allows the detection of evolutionary more distant relationships since protein sequences are more conserved than DNA sequences. CONCLUSIONS: GeneOrder3.0 allows researchers to perform comparative analysis of gene order and synteny in genomes of sizes up to 2 Mb "on-the-fly." Availability: and

    Computational identification of strain-, species- and genus-specific proteins

    Get PDF
    BACKGROUND: The identification of unique proteins at different taxonomic levels has both scientific and practical value. Strain-, species- and genus-specific proteins can provide insight into the criteria that define an organism and its relationship with close relatives. Such proteins can also serve as taxon-specific diagnostic targets. DESCRIPTION: A pipeline using a combination of computational and manual analyses of BLAST results was developed to identify strain-, species-, and genus-specific proteins and to catalog the closest sequenced relative for each protein in a proteome. Proteins encoded by a given strain are preliminarily considered to be unique if BLAST, using a comprehensive protein database, fails to retrieve (with an e-value better than 0.001) any protein not encoded by the query strain, species or genus (for strain-, species- and genus-specific proteins respectively), or if BLAST, using the best hit as the query (reverse BLAST), does not retrieve the initial query protein. Results are manually inspected for homology if the initial query is retrieved in the reverse BLAST but is not the best hit. Sequences unlikely to retrieve homologs using the default BLOSUM62 matrix (usually short sequences) are re-tested using the PAM30 matrix, thereby increasing the number of retrieved homologs and increasing the stringency of the search for unique proteins. The above protocol was used to examine several food- and water-borne pathogens. We find that the reverse BLAST step filters out about 22% of proteins with homologs that would otherwise be considered unique at the genus and species levels. Analysis of the annotations of unique proteins reveals that many are remnants of prophage proteins, or may be involved in virulence. The data generated from this study can be accessed and further evaluated from the CUPID (Core and Unique Protein Identification) system web site (updated semi-annually) at . CONCLUSION: CUPID provides a set of proteins specific to a genus, species or a strain, and identifies the most closely related organism

    A framework for application of metabolic modeling in yeast to predict the effects of nsSNV in human orthologs

    Get PDF
    Background We have previously suggested a method for proteome wide analysis of variation at functional residues wherein we identified the set of all human genes with nonsynonymous single nucleotide variation (nsSNV) in the active site residue of the corresponding proteins. 34 of these proteins were shown to have a 1:1:1 enzyme:pathway:reaction relationship, making these proteins ideal candidates for laboratory validation through creation and observation of specific yeast active site knock-outs and downstream targeted metabolomics experiments. Here we present the next step in the workflow toward using yeast metabolic modeling to predict human metabolic behavior resulting from nsSNV. Results For the previously identified candidate proteins, we used the reciprocal best BLAST hits method followed by manual alignment and pathway comparison to identify 6 human proteins with yeast orthologs which were suitable for flux balance analysis (FBA). 5 of these proteins are known to be associated with diseases, including ribose 5-phosphate isomerase deficiency, myopathy with lactic acidosis and sideroblastic anaemia, anemia due to disorders of glutathione metabolism, and two porphyrias, and we suspect the sixth enzyme to have disease associations which are not yet classified or understood based on the work described herein. Conclusions Preliminary findings using the Yeast 7.0 FBA model show lack of growth for only one enzyme, but augmentation of the Yeast 7.0 biomass function to better simulate knockout of certain genes suggested physiological relevance of variations in three additional proteins. Thus, we suggest the following four proteins for laboratory validation: delta-aminolevulinic acid dehydratase, ferrochelatase, ribose-5 phosphate isomerase and mitochondrial tyrosyl-tRNA synthetase. This study indicates that the predictive ability of this method will improve as more advanced, comprehensive models are developed. Moreover, these findings will be useful in the development of simple downstream biochemical or mass-spectrometric assays to corroborate these predictions and detect presence of certain known nsSNVs with deleterious outcomes. Results may also be useful in predicting as yet unknown outcomes of active site nsSNVs for enzymes that are not yet well classified or annotated
    • …
    corecore